library(tidyverse)
demographics = read.csv('data/demos_anonymized.csv')
ids = read.csv('data/ids_anonymized.csv')
model_variables = read.csv('data/model_variables_anonymized.csv')
model_1 = lm(award_total_amount ~ libuser,
data = model_variables)
coef(model_1)
summary(model_1)
Note there are only two fitted values, one for library users and one for the non-users.
fitted(model_1)[1:10] # first 10 predictions on current data set
Use modelr to add them to your data set.
library(modelr) # need to install?
model_variables %>%
add_predictions(model_1, var = 'Prediction') %>%
group_by(libuser) %>%
summarise(result = mean(Prediction))
The broom library also has functionality for many common models.
library(broom) # need to install?
augment(model_1)
Use ggeffects for a quick visual!
library(ggeffects) # need to install?
ggpredict(model_1) %>%
plot()
We can assess performance: - within a model (e.g. Adjusted \(R^2\)) - by comparing models (e.g. anova test) - with prediction on new data (e.g. RMSE)
model_2 = lm(award_total_amount ~ libuser + gender,
data = model_variables)
anova(model_1, model_2)
Other approaches (all must be done on same data!):
adjr1 = summary(model_1)$adj
adjr2 = summary(model_2)$adj
round(c(adjr1, adjr2), 2)
AIC is uninterpretable on its own, but reflects the likelihood of the data given the model and complexity. When you have multiple AIC values, the lower value is the better model.
AIC(model_1, model_2)
model_logreg = glm(award_total_amount > 5e6 ~ age,
data = model_variables,
family = binomial)
summary(model_logreg)
The caret package has a nice function to summarize classification performance via a variety of metrics.
It needs labeled factors to use though. Often you would have this anyway.
library(caret) # need to install?
predictions = predict(model_logreg) > 0 # same as probability > .5
predictions = factor(predictions, labels = c('low', 'high'))
target = model_variables$award_total_amount > 5e6
target = factor(target, labels = c('low', 'high'))
confusionMatrix(predictions, target)
While Python has statistical modeling capabilities, it’s nowhere near the level of R. If you just want to stay in the Python world, you can do basic models without issue. However, it suffers from consistency of method, documentation, stability across updates etc. You might be better off simply calling R with something like rpy2 or a similar approach.
# note how when using something other than R, you have to specify the engine path
import pandas as pd
import numpy as np
import statsmodels
model_variables = pd.read_csv('data/model_variables_anonymized.csv')
Statsmodels is Python’s attempt to do statistical modeling. It essentially follows R’s formula style.
import statsmodels.api as sm
import statsmodels.formula.api as smf
# specify the model, then fit
model_1 = smf.ols('award_total_amount ~ libuser', data=model_variables)
results_1 = model_1.fit()
# Inspect the results
print(results_1.summary())
model_2 = smf.ols('award_total_amount ~ libuser + gender', data=model_variables)
results_2 = model_2.fit()
# Inspect the results
print(results_2.summary())
print(sm.stats.anova_lm(results_1, results_2))
model_variables['award_high'] = 1*(model_variables['award_total_amount'] > 5e6) # 1* makes it a numeric
logit_mod = smf.logit('award_high ~ libuser + gender', data = model_variables)
logit_result = logit_mod.fit()
print(logit_result.summary())
logit_result.pred_table() / model_variables.shape[0]